Updated chunking_document. #65

PalmPalm7 · 2024-07-02T18:04:17Z

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Updates:

Applied document-specific test splitter from Langchain in replace of original naive version.
Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

Saved for later:

"Soft dependency" freeze and introducing Magika should wait.

For further PR, an efficient library to detect file type could be added for document-specific chunking and heuristics.

How do you handle chunks that are larger than the provided server_ctx_size ?

No logic on larger chunk size updated this PR. "CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible." See this StackOverflow Post.

Is it assumed that if the file is not any of code file it always markdown?

My Justification is there is I assume our most common use cases, as we discussed, is PDF into markdown formats, therefore default case should be markdown. Furthermore, by specifying the language param in RecursiveCharacterTextSplitter, it uses these following separators:

RecursiveCharacterTextSplitter.get_separators_for_language(Language.MARKDOWN)

Output:
['\n#{1,6} ',
 '```\n',
 '\n\\*\\*\\*+\n',
 '\n---+\n',
 '\n___+\n',
 '\n\n',
 '\n',
 ' ',
 '']

Instead of these separators.

["\n\n", "\n", " "]

Original PR, see: #45

abhi1092 · 2024-07-02T20:01:35Z

Tested this PR. Chunking seems to be working as expected. @PalmPalm7 as discussed can you add the handling of document if its not passed as list?

PalmPalm7 · 2024-07-02T20:15:01Z

Tested this PR. Chunking seems to be working as expected. @PalmPalm7 as discussed can you add the handling of document if its not passed as list?

Updated chunking method with this logic. Ready to merge @abhi1092 .

russellb

Can you please squash your 3 commits into 1 and add all of the description you have on the PR into the commit message? Let me know if you need any assistance doing this!

PalmPalm7 · 2024-07-02T20:38:56Z

Can you please squash your 3 commits into 1 and add all of the description you have on the PR into the commit message? Let me know if you need any assistance doing this!

Thank you Russell! I have squashed and amended my commits. Ready to merge @russellb .

russellb

looks good -- passing e2e CI, and I like that it falls back to the old method in case of an error.

Instead of a raw print() I would prefer using a logger. Can you add a new parameter for logger, update consumers to pass in a logger, and then use that for the messages you're printing? Thanks!

src/instructlab/sdg/utils/chunking.py

PalmPalm7 · 2024-07-02T21:51:29Z

looks good -- passing e2e CI, and I like that it falls back to the old method in case of an error.

Instead of a raw print() I would prefer using a logger. Can you add a new parameter for logger, update consumers to pass in a logger, and then use that for the messages you're printing? Thanks!

Done! @russellb

I've added this parameter:
logger = logging.getLogger(__name__) and replaced all existing print functions.

russellb

lgtm, nice work!

(you could drop "resolved merge conflict" from the PR title / commit message, but it's not a big deal, and not worth the CI reset probably)

src/instructlab/sdg/utils/chunking.py

RobotSail

Some minor comments but LGTM overall, great work!

markmc · 2024-07-03T13:19:08Z

Updates:

Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it

If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?

IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

russellb · 2024-07-03T13:39:45Z

Updates:

Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it

If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?

IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

That makes a ton of sense. Thanks for thinking this through, Mark.

Markdown is the only knowledge doc format supported via the CLI right now, so it's only format we need to support at the moment.

russellb

Changing review based on Mark's feedback. I think he's right about the dependency and what features are required right now vs. can come later.

markmc · 2024-07-03T13:49:36Z

@PalmPalm7 if you split the magika part into a separate PR, we could merge the rest as a super useful improvement to the markdown support 👍

dhellmann · 2024-07-03T13:59:39Z

@PalmPalm7 if you split the magika part into a separate PR, we could merge the rest as a super useful improvement to the markdown support 👍

+1 -- magika introduces a dependency on onnxruntime, which is another package that doesn't publish standard source code packages so we'll have to figure out how to build it to prepare a release downstream. It would be nice if we could wait to bring that in until this first major release is settled.

PalmPalm7 · 2024-07-03T14:04:47Z

Thank you for taking the time to review and foresee potential risks @dhellmann @markmc @russellb , I'll split it into to two PR.

PalmPalm7 · 2024-07-03T15:52:53Z

Updates:

Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it
If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?
IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

That makes a ton of sense. Thanks for thinking this through, Mark.

Markdown is the only knowledge doc format supported via the CLI right now, so it's only format we need to support at the moment.

Updated logic according to Mark and Doug's suggestions above. Ready to merge @russellb .

1. Applied document-specific test splitter from Langchain in replace of original naive version. 2. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window. 3. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows. Signed-off-by: Andy Xie <[email protected]>

dhellmann · 2024-07-03T16:18:41Z

Thank you for taking the time to review and foresee potential risks @dhellmann @markmc @russellb , I'll split it into to two PR.

Thank you for being flexible!

oindrillac · 2024-07-05T17:25:09Z

@russellb can you please check if the changes addressed your concern above?

russellb · 2024-07-06T17:44:52Z

src/instructlab/sdg/utils/chunking.py

    )

+    # Determine file type for heuristics, default with markdown


looks like this comment may be out of date, but not a big deal

This was referenced Jul 2, 2024

Updated Chunking Methods #45

Closed

Update chunking for knowledge documents #34

Closed

Further update chunking strategies to improve performance. #66

Closed

russellb requested changes Jul 2, 2024

View reviewed changes

PalmPalm7 force-pushed the main branch from 693ee6c to 513c669 Compare July 2, 2024 20:37

russellb mentioned this pull request Jul 2, 2024

server_ctx_size is ignored in document chunking method #69

Closed

russellb requested changes Jul 2, 2024

View reviewed changes

src/instructlab/sdg/utils/chunking.py Outdated Show resolved Hide resolved

PalmPalm7 force-pushed the main branch from 513c669 to 4558c37 Compare July 2, 2024 21:42

mergify bot added the ci-failure label Jul 2, 2024

PalmPalm7 force-pushed the main branch from 4558c37 to d043130 Compare July 2, 2024 21:46

mergify bot removed the ci-failure label Jul 2, 2024

russellb approved these changes Jul 2, 2024

View reviewed changes

abhi1092 approved these changes Jul 2, 2024

View reviewed changes